Introduction 



Although hundreds of genome-wide association studies (GWAS) have been conducted and many 
thousands of genomic regions have been implicated in recent years, these findings have not been 
translated into clinically useful risk prediction models based on genetic markers, which limits 
the potential impact on personalized disease treatment and prevention. Typically, genetic risk 
prediction models, including those used by most direct-to-customer genetic testing companies, 
are constructed based on a few significant (usually validated) Single Nucleotide Polymorphisms 
(SNPs). However, such SNPs typically account for only a small fraction of the total heritability 
and thus cannot provide satisfactory predictive capability 19 . A large fraction of the genetic 



variances is not accounted for in the genetic risk prediction because they were "missing" in the 
GWAS results 



17 



Several hypotheses have been put forward to explain the "missing heri- 
tability" [5], including epistasis [2,13 , undetected CNVs [7], rare variants |3], among others. 
But recent studies suggest that much of the missing heritability is in fact hidden among the 
numerous common genetic variants carrying only small to modest effects [8,20,27 . Through a 
polygenic risk-score analysis on GWAS data of schizophrenia (SZ) and bipolar disorder (BP), 



Purcell et al 21 suggested that the genetic risk factors for SZ and BP involve thousands of 
common SNPs of very small effect sizes. Yang et al 28 used a linear mixed model method 



and estimated that whole-genome common SNPs could explain about 45% of the human height 



variance. Lee et al 15 extended this method to human complex diseases and also found that a 



substantial fraction of the phenotypic variances could be explained by whole-genome common 



SNPs. We follow Visscher et al 26 to refer to the proportion of phenotypic variance explained 
by all common SNP as the "narrow-sense heritability", denoted as h 2 . 



These results suggest that, compared with the genetic prediction models based on only a 
few significant variants, a more sensible approach is to impose less stringent criteria on variant 
selection or even to build prediction models using whole genome variants |3|. In this way, more 
weak-effect variants can be utilized though at the expense of including many null variants. 
Following Makowsky et al |18|, we refer to this class of methods as Whole Genome Prediction 



(WGP) methods. But according to Makowsky et al 18 , even for the WGP methods, there 



is still much significant gap between the actual prediction accuracy and the genetic variance 
accounted by all common SNPs. They also investigated several parameters that affect the pre- 
diction accuracy and showed that substantial gain in prediction accuracy can be obtained by 
increasing the training sample size and inclusion of training samples that are more related to 
the testing samples. 



The results of Makowsky et al 18 suggest that, in the presence of numerous weak-effect 



variants, one of the most effective ways to improve the prediction accuracy is to recruit more 
training samples. But sample recruitment may be difficult and expensive. Alternatively, if 
another phenotype is affected by similar genetic variants that affect the phenotype of primary 
interest, it might be possible to improve the prediction accuracy by incorporating the available 
data for this correlated phenotype. In this way, the training sample size is also increased, al- 
though the additional samples are not as informative as the original ones. 



In fact, there is accumulating evidence suggesting that different human complex traits are 
genetically correlated, i.e. multiple traits share common genetic basis, which is also formally 
known as "pleiotropy" . In a systematic analysis of the open-access NHGRI catalog, 17% of 



the trait-associated genes and 5% of the trait-associated SNPs showed pleiotropic effects 22 
Vattikuti et al [25] used a bivariate linear mixed model to analyze the Atherosclerosis Risk in 
Communities (ARIC) GWAS and the results suggest the existence of significant genetic corre- 
lations between several metabolic syndrome traits, including body-mass index (BMI), waist-to- 
hip ratio (WHR), systolic blood pressure (SBP), fasting glucose (GLU), fasting insulin (INS), 



fasting trigylcerides (TG), and fasting high-density lipoprotein (HDL). Lee et al 16 extended 
this bivariate linear mixed model so that it could deal with binary traits, i.e. human diseases. 
Andreassen et al fl] applied a "pleiotropic enrichment" method on GWAS data of schizophre- 
nia and cardiovascular-disease and showed that the power to detect schizophrenia-associated 
common variants can be improved by exploiting the pleiotropy between these two phenotypes. 
More recently, a study on genome-wide SNP data for five psychiatric disorders in 33,332 cases 
and 27,888 controls identified four significant loci (P < 5 x 10~ 8 ) affecting multiple disorders, 
including two genes encoding two L-type voltage-gated calcium channel subunits, CACNA1C 



and CACNB2 23 



All the evidence is particularly exciting because they imply that pleiotropy widely exists 
among human complex diseases and hence leveraging pleiotropy between correlated pheno- 
types might be a promising strategy to improve genetic risk prediction in the future. Indeed, 



pleiotropic SNPs are drawing more interests in association analyses 12 14 , yet not much at- 
tention has been paid to their utility in genetic prediction. Here, we present a systematic 
analysis of the utility of pleiotropy in genetic prediction by investigating the gain of prediction 
accuracy as a function of the strength of genetic correlation between two traits. We also ex- 
amine the effect of several other parameters such as the h 2 , the training sample size and the 
number /proportion of causal SNPs. 



Results 

Real Data Analysis 

We analyzed a GWAS data set for bipolar and related disorders (BARD) and schizophrenia 



(SZ) that we obtained from the dbGaP database (https://dbgap.ncbi.nlm.nih.gov). After our 
pre-processing (see Methods), the data set consists of 2,995 individuals (646 cases for BARD, 
1,159 cases for SZ and 1,190 controls) and 298,801 SNPs. In order to organize two case-control 
data sets for the two diseases, we randomly split the controls into two disjoint sets of individ- 
uals and merge each set into one of the diseases. We split the control individuals so that the 
proportions of cases for the two data sets were approximately equal, i.e. we always assign 426 
controls to BARD and 764 controls to SC. The random splitting was repeated 100 times. 

For the 100 repeats, the mean estimated genetic correlation between the two traits was 0.79 
(s.d. = 0.18). But we failed to estimate the h 2 of for the two phenotypes because most of the 



time the estimates of the observed scale h 2 15 were on the boundary of the parameter space. 



We then used a bivariate linear mixed model-based (BVLMM) method to jointly predict the 
genetic risks for BARD and SZ using this data. Prediction performance was evaluated through 
five-fold cross-validation, i.e., the data were splitted into five equal-sized folds and four of 
them were used as the training data and the other one was used as the testing data each 
time. We calculated the mean AUC (area under the ROC curve) for the five folds to measure 
the prediction performance. For comparison, we also evaluated the prediction performance of 
three other prediction methods (ridge regression, support vector machine and LASSO) that 
treat each data set separately. The results are shown in Figure 1. For BARD, the AUC for 
BVLMM is about 3% higher than ridge regression, 2% higher than SVM and 4.8% higher 
than LASSO. As for SZ, the advantage of BVLMM over ridge regression and SVM is less 
noticeable (only 0.7% and 0.3%), although its improvement over LASSO is more significant 
(7%). The BVLMM method we used can be considered as a bivariate extension of ridge 
regression that takes into account the genetic correlation between the two diseases. Therefore 
the most interesting comparison is between BVLMM and ridge regression. BVLMM achieved a 
substantial improvement in BARD but only a minor improvement in SZ. Some possible reasons 
are: 1) the shared causal SNPs have larger effect sizes in SZ than in BARD; 2) the SZ data 
has a larger sample size. In both cases, more information can be borrowed from SZ to BARD. 
Therefore jointly modeling gives more benefit to BARD than SZ in the prediction. The poor 
performance of LASSO on this data is probably due to the mis-match between the sparsity 
assumption of LASSO and the actual genetic architectures of the two diseases, i.e. LASSO is 
a sparse penalty method favoring the scenario where there is only a small number causal SNPs 
whereas it is very likely that there are many causal SNPs responsible for the two diseases. 
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Figure 1: Prediction accuracy of different methods the BARD and SZ data. 



To investigate the effect of genetic correlation between BARD and SZ in the real data, 
we randomly chose 33%, 67% and 100% of the SNPs and permuted their identities among 
the BARD cases and controls while keeping the SNP identities in the SZ samples. This will 
lead to reduced genetic correlations between BARD and SZ. We applied the BVLMM method 



after this procedure and compared the results with ridge regression (Figure 2). As more SNPs 
were permuted, the prediction accuracy of BVLMM gradually decreased and eventually became 
similar to ridge regression when all the SNPs were permuted. 
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Figure 2: Prediction accuracy of BVLMM after permuting the SNP identities and of ridge regression. Red 
plots represent the results of BVLMM and blue plots represent those of ridge regression. The percentages below 
the red plots are the fractions of permuted SNPs. 



Simulation Studies 

We simulated data sets for two diseases to investigate the utility of genetic correlation in ge- 
netic risk prediction. We simulated a total of p = 20, 000 SNPs. We varied the sample size 
N from 1,000 to 4,000. The number of causal SNPs m was also varied from 1,000 to 4,000. 
We assumd the two traits had the same number of casual SNPs. We varied the proportion of 
the causal SNPs that were shared between the two diseases 7 from to 1 to reflect different 
levels of genetic correlation. We assumed that the shared causal SNPs had a correlation of 0.8 
between their effect sizes on the two diseases. Our empirical results suggest that the realized 
genetic correlation is approximately O.87 under this setup. We simulated two h 2 levels (on 
liability scale): 0.3 and 0.6. Disease prevalence was set to be 0.05 and equal numbers of cases 
and controls were drawn from the simulated population. The simulations were repeated for 
50 times in each scenario. In each repeat, a training set and a testing set were independently 
generated and they were used to evaluate the performances of BVLMM and ridge regression. 



We first considered the scenario where both diseases had the same number of samples 
and the same h 2 level. The results for m = 2000 are shown in Figure 3 and the results for 
m = 1000 and m = 4000 are shown in the supplementary material (Figure SI and Figure S2). 
Not surprisingly, in all cases, the prediction performance of BVLMM improved as the genetic 
correlation increased. More importantly, the performance of BVLMM was at least comparable 
with that of ridge regression regardless of the level of genetic correlation between the two traits, 



suggesting that integrating the data for two diseases has negligible, if any, negative effects on 
the predictive capability even when the two diseases are completely uncorrelated. The slight 
decrease of AUC, if any, when the two diseases were uncorrelated might reflect cost of trying 
to estimate the genetic correlation between the two diseases. But such decrease is minor and 
almost diminishes as the sample size increases. Moreover, regardless of the sample size, BVLMM 
had significantly better performance than ridge regression when the genetic correlation was 0.6 
or higher. 
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Figure 3: Simulation results for the case of m = 2000 and h 2 for the two diseases are the same. The box plots 
on the left are for h 2 = 0.6 and the box plots on the right are for h 2 = 0.3. Given an h 2 level, there are three 
sample sizes: 1000, 2000 and 4000. For each sample size, the proportion of shared causal SNPs, 7 is varied from 
0,0.25,0.5,0.75 to 1. The numbers below the blue box plots (ridge regression) are the sample sizes. Following 
the blue box plots are the red box plots corresponding to the results of BVLMM with the same sample size. 
The 7 values are under the respective red box plots. 



We also examined the case when the sample sizes (Figure 4, Figure S3 and Figure S4) or the 
h 2 levels (Figure 5, Figure S5 and Figure S6) are different between the two traits. Compared 
with the scenario where the h 2 levels and sample sizes are the same for the two traits, the 
prediction of the trait with the smaller sample size gained more benefit than the trait with 
the larger sample size, which echoes with the results of the real data analysis. In addition, the 
prediction of the lower-heritability disease also enjoyed larger improvement than the higher- 
heritability disease. 



m = 2000 ,N = 4000(left) and N = 2000(right) 
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Figure 4: Simulation results for the case of m — 2000 and sample sizes differ for the two diseases. The symbols 
are similar to those of Figure 3. The box plots on the left are for the large-sample disease (N — 4000) and the 
box plots on the right are for the small-sample disease (TV = 2000). 

Discussion 



Recently, several studies 16,25 have proposed to use the bivariate linear mixed model to es- 



timate the genetic correlation between phenotypes. Following this framework, we performed a 
systematic investigation of the utility of pleiotropy in genetic risk prediction using a bivariate 
linear mixed model approach. In both real data and simulated data, we have demonstrated 
that the predictive capability of whole genome prediction (WGP) methods can be improved by 
leveraging the pleiotropy between two diseases when there exists substantial genetic correlation 



between them. To our knowledge, Hartley et al 10, 11 was the only study that has tried to 



utilize pleiotropy in genetic risk prediction. However, there are two differences between our 
study and Hartley et al. Firstly, Hartley et al focused on methodological development whereas 
our study is a systematic analysis of the usefulness of pleiotropy in genetic prediction. Secondly, 
the input data for the their method has a structure that is different from the one that we are 
interested in. Hartley et al's method requires that all the phenotypes of interest are observed 
on all individuals. But we are more interested in the case where two phenotypes are observed 
on two disjoint set of individuals. This is because in the latter case we can expect much more 
improvement of the predictive ability since the sample size is effectively increased by combining 
the data for the two phenotypes whereas there is no change in the sample size in the former 
case. Another reason is that when dealing with the case-control data with ascertainment from 
the population and, it would be inappropriate to treat the cases for one disease as controls for 
another disease. This may cause some shared signal between the two diseases to be cancelled 
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Figure 5: Simulation results for the case of m = 2000 and h 2 values differ between the two diseases. The 
symbols are similar to those of Figure 3. The box plots on the left are for the higher- heritability disease 
(h 2 = 0.6) and the box plots on the right are for the lower-heritability disease (h 2 = 0.3). 



out. 



The method that we used to jointly predict two phenotypes is based on the bivariate linear 
mixed model and also can be considered as a bivariate extension to the ridge regression. We 
note that it is very likely that better methods exist for joint prediction of multiple phenotypes, 
as in the single phenotype case SVM does better than ridge regression. But by comparing 
the results for BVLMM and that of ridge regression, we can demonstrate that there is indeed 
benefit to combine two genetically correlated phenotypes. Development of better methods that 
jointly predict multiple phenotypes is the future task. 



Methods 



Data preparation 

The subjects were genotyped on the Affymetrix Genome- Wide Human SNP Array 6.0 platform. 



See http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study. cgi?studyJd=phs000021.v3.p2 and 



http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study. cgi?study_id=phs000017. v3.pl for more 



details. We collected the genotype data for subjects with European-American ancestry. We 
estimated their pairwise genetic relationships according to Yang et al 28 and retained 2,995 



individuals (646 cases for BARD, 1,159 cases for SZ and 1,190 controls) with estimated pair- 



wise relationships smaller than 0.05. SNPs with missing rate greater than 0.01 and minor allele 
frequency smaller than 0.05 were excluded. The remaining missing genotypes were randomly 
drawn from binomial distributions based on the allele frequencies at the given loci. SNPs that 
failed the Hardy- Weinberg Equilibrium test (P < 0.0001) in either BARD, SZ or control group 
were also excluded. We also performed linkage-disequilibrium pruning so that no pair of SNPs 
within a 50-SNP window has an R-squared greater than 0.8. After these procedures, 298,801 
SNPs were retained. 

The bivariate linear mixed model 

Let y' 1 ) and y^ 2 ^ be the two vectors of two phenotypes measured on m and rii individuals, 



respectively. They are modeled with a standard bivariate linear mixed model 24 
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where the means of the two phenotypes and other fixed effects are included the design matrices 
X^ 1 ) and X^ 2 ); (3^ and (3^ are their respective effect sizes, g 1 - 1 ) and g^ 2 ^ are the genetic values 
for the two phenotypes. e^ 1 ) and e^ 2 ^ are the residuals. The genetic values g 1 - 1 ^ and g^ 2 - 1 are 
treated as random effects: 
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where G^ is the standardized genotype matrix for a total of p SNPs of the samples for the t-th 
phenotype. Specifically, let B and b be the two alleles at the j-th locus and fj be the frequency 
of the B allele, then gJ } takes a value of -2f j / y /2f j {l-f j )p, (1 - 2f j )/ y /2f j {l - Jjjp or 
2(1 — /j)/y2/j(l — fj)p if the genotype of the z-th individual at the j-th locus is bb, Bb or BB, 
respectively, u- and u- are the random effects of the j'-th locus for the two phenotypes and 
p g measures the strength of genetic correlation between the two phenotypes. We also assume 
the residuals follow a normal distribution: 
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We use the average information restricted maximum likelihood (AI-REML) algorithm M 
to estimate the five variance parameters cr^, <r 2 2 , p g , a^, a^ 2 as well as the fixed effects /3^\ 
(3^ from the training data. Then we use the estimated parameters to construct the predictive 
models. The predictive models are constructed directly based on the "best linear unbiased 
predictors" (BLUP) of the random effects u^ 1 ', u^ and the "best linear unbiased estimates" 
(BLUE) of the fixed effects /3«, (3^. We denote the them as u«, u^ 2 ) and $^\ ft 2 \ Let t s 



and E e be the REML estimates of the variance parameters. The BLUP for the random effects 
can be written out as: 
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and ® standards for the Kronecker product. In fact, the BLUP can 



be considered as a shrunk estimator of the genetic effects where the shrinkage level is controlled 
by the empirical estimates of the genetic variance and residual variance. The BLUP gives the 
best unbiased prediction when the samples are generated from a Gaussian linear model; however 
it may not be a good predictor in other cases, e.g. case-control data. To make our predictive 
model more flexible, we introduce a factor A to control the shrinkage level. Specifically, given 
a A, the estimates of the random effects are: 
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Then given a set of testing individuals with covariates X^ test ^ and standardized genotype matrix 
Q(test)^ ^g p rec ii c ted values for the two phenotypes are: 



y (l) = X (t«t)^(l) + G (test ){i (l) )y (2) = x (test) / 3(2) + G (teet) fl (2) 



(10) 



In our analysis, we let A = 1 in the simulations for quantitative traits because in such cases, 
BLUP gives the best unbiased prediction accuracy. In other cases (e.g. binary trait), we chose 
the A that gives the best prediction accuracy. 



Simulation Setup 

For quantitative traits, we first simulated genotypes for N individuals on p SNPs. The minor 
allele frequencies (MAF) for the p SNPs were drawn uniformly form [0.05,0.5]. Then the 
genotypes for the N individuals at a given SNP were drawn from a binomial distribution 
according the its MAF. We randomly chose m SNPs to be causal SNPs. Given a MAF fj the 

effect size of a SNP was simulated from a normal distribution with mean zero and variance of 

h 2 



- — , , , . — -r^—. The environmental effects were simulated from a standard normal distribution 
(mean zero and variance one). In this way, we can control the h 2 . When two traits are jointly 
simulated, the two data sets should share the same set of SNPs. Furthermore, m' causal SNPs 
{m! < m) were randomly chosen to be causal in both traits and their effect sizes for the two 
traits were simulated with a correlation coefficient of 0.8. Therefore, the genetic correlation 
between the two traits can be controlled by m' . 

For binary traits, we followed the classical liability threshold model (6) to simulate the 
genotypes and phenotypes. Specifically, given a desired sample size N, the proportion of cases 
in the data P and the disease prevalence K, genotypes of a cohort of at least NP/K individuals 
were generated in the same way as in the case of quantitative traits. Similar to the quantitative 
traits, m SNPs were randomly chosen to be causal and an underlying quantitative trait - the 



10 



liability- was simulated given a desired liability-scale heritability h\. Once the liabilities were 
simulated, the quantile that corresponds to the disease prevalence K was set as threshold for 
assigning disease status labels. Desired numbers of cases and controls were randomly drawn 
from individuals with liabilities greater and smaller than the threshold respectively. Also similar 
to the quantitative traits, w! causal SNPs shared between the two traits were randomly chosen 
with effect size correlation of 0.8. 
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